polyglot
===============================

[![Downloads](https://img.shields.io/pypi/dm/polyglot.svg "Downloads")](https://pypi.python.org/pypi/polyglot)
[![Latest Version](https://badge.fury.io/py/polyglot.svg "Latest Version")](https://pypi.python.org/pypi/polyglot)
[![Build Status](https://travis-ci.org/aboSamoor/polyglot.png?branch=master "Build Status")](https://travis-ci.org/aboSamoor/polyglot)
[![Documentation Status](https://readthedocs.org/projects/polyglot/badge/?version=latest "Documentation Status")](https://readthedocs.org/builds/polyglot/)

Polyglot is a natural language pipeline that supports massive multilingual applications.

* Free software: GPLv3 license
* Documentation: http://polyglot.readthedocs.org.

###Features


* Tokenization (165 Languages)
* Language detection (196 Languages)
* Named Entity Recognition (40 Languages)
* Part of Speech Tagging (16 Languages)
* Sentiment Analysis (136 Languages)
* Word Embeddings (137 Languages)
* Morphological analysis (135 Languages)
* Transliteration (69 Languages)

### Developer

* Rami Al-Rfou @ `rmyeid gmail com`


## Quick Tutorial

In [9]:
import polyglot
from polyglot.text import Text, Word

### Language Detection

In [10]:
text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))

Language Detected: Code=fr, Name=French



### Tokenization

In [11]:
zen = Text("Beautiful is better than ugly. "
 "Explicit is better than implicit. "
 "Simple is better than complex.")
print(zen.words)

[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']


In [12]:
print(zen.sentences)

[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]


### Part of Speech Tagging

In [13]:
text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
 print(u"{:<16}{:>2}".format(word, tag))

Word POS Tag
------------------------------
O DET
primeiro ADJ
uso NOUN
de ADP
desobediência NOUN
civil ADJ
em ADP
massa NOUN
ocorreu ADJ
em ADP
setembro NOUN
de ADP
1906 NUM
. PUNCT


### Named Entity Recognition

In [14]:
text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)

[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]


### Polarity

In [15]:
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
 print("{:<16}{:>2}".format(w, w.polarity))

Word Polarity
------------------------------
Beautiful 0
is 0
better 1
than 0
ugly -1
. 0


### Embeddings

In [19]:
word = Word("Obama", language="en")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
 print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])

Neighbors (Synonms) of Obama
------------------------------
Bush 
Reagan 
Clinton 
Ahmadinejad 
Nixon 
Karzai 
McCain 
Biden 
Huckabee 
Lula 


The first 10 dimensions out the 256 dimensions

[-2.57382345 1.52175975 0.51070285 1.08678675 -0.74386948 -1.18616164
 2.92784619 -0.25694436 -1.40958667 -2.39675403]


### Morphology

In [17]:
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)

[u'Pre', u'process', u'ing']


### Transliteration

In [18]:
from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))

препрокессинг
